RNAseq (part 1)

RNA-seq

RNA-seq offers a method to simultaneously measure the genome wide expression of annotated and novel transcripts.

igv

RNA-seq for differential gene expression

A common goal in RNA-seq is to identify genes and/or transcripts which are expressed at differing levels between our user defined sample groups.

Absolute quantification of the expression level of gene is subject to measurement biases (GC bias in PCR and/or RNA capture/selection) so it is most appropriate to compare the levels of the expression of same gene across conditions than to compare differing genes’ expression levels within a condition.

This commonly referred to as differential gene or transcript expression (DGE/DTE).

igv

RNA-seq for differential transcript usage

A less frequent but important goal in RNAseq is to identify any changes in the relative abundance of transcripts for a gene between conditions. This may be able to identify changes in a genes’ functional role between conditions. i.e. One isoform of a protein may bind and activate, another may simply bind and obstruct other active forms. We call this differential transcript usage (DTU).

igv

The data

In this session we will be reviewing data from Christina Leslie’s lab at MSKCC on T-Reg cells. This can be found on the Encode portal here

FastQ for TReg RNAseq replicate 1 used in this session can be downloaded here

FastQ for TReg RNAseq replicate 2 used in practical can be downloaded here

For differential splicing and transcript analysis we will use one of our datasets from a splicing factor Knockdown RNAseq experiment found at GEO here.

What we will cover

In the RNAseq sessions we will identify differential gene expression between our T-cell sample groups as well identify potential differential transcript usage. We will further identify biological functions associated with genes and visualize these in IGV and in summary plots.

We will also identify differentially used transcript and exons from our splice factor dataset and visualise these in IGV.

In our first session we will look at two separate ways we can gain gene expression estimates from raw sequencing data as fastQ files.

Working with raw RNAseq data

Working with raw RNAseq data

Once we have the raw fastQ data we can use the ShortRead package to review our sequence data quality.

We have reviewed how to work with raw sequencing data in the FastQ in Bioconductor session.

Reading raw RNAseq data

We can subsample from a fastQ file using functions in ShortRead package.

Here we use the FastqSampler and yield function to randomly sample a defined number of reads from a fastQ file. Here we subsample 1 million reads. This should be enough to to do assess the quality of the sequencing.

## class: ShortReadQ
## length: 1000000 reads; width: 50 cycles

Raw RNAseq data QC

Now we have our ShortRead object we can produce some of our more informative plots of fastQ properties.

First we can extract the read sequences, using the sread() function, and retrieve a matrix of base pair abundance over sequencing cycles (or length of read) using the alphabetByCycle() function.

##         cycle
## alphabet   [,1]   [,2]   [,3]   [,4]
##        A 292748 275697 261728 263146
##        C 264301 218260 246023 244318
##        G 207470 253398 239248 237533
##        T 234995 252593 252999 255001

Raw RNAseq data QC

We can extract information on reads quality scores using the quality() function and translate this into a useful score summary per read using the alphabetScore() function.

We plot the distribution of quality scores to identify whether low quality reads should be filtered.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Raw RNAseq data QC

A final essential check of fastq quality is to plot the quality of sequencing over the length of the read (and so over time). Here we can use the as(MYQUALITIES,“matrix”) function to translate our ASCII encoded scores into numeric values and create a boxplot for visualisation.

Aligning RNAseq data

Aligning RNAseq reads

Following assessment of read quality, we will want to align our reads to the genome taking into account splice junctions.

Since RNAseq reads will not all align continously agaist our reference genome but will map across splice junctions we will use our splice aware aligners we have seen in previous sessions. The resulting BAM file will contain aligned sequence reads for use in further analysis.

igv

Creating a reference genome

First we need to retrieve the sequence information for the genome of interest in FASTA format

We can use the BSgenome libraries to retrieve the full sequence information.

For the mouse mm10 genome we load the package BSgenome.Mmusculus.UCSC.mm10.

## Mouse genome:
## # organism: Mus musculus (Mouse)
## # provider: UCSC
## # provider version: mm10
## # release date: Dec. 2011
## # release name: Genome Reference Consortium GRCm38
## # 66 sequences:
## #   chr1                 chr2                 chr3                
## #   chr4                 chr5                 chr6                
## #   chr7                 chr8                 chr9                
## #   chr10                chr11                chr12               
## #   chr13                chr14                chr15               
## #   ...                  ...                  ...                 
## #   chrUn_GL456372       chrUn_GL456378       chrUn_GL456379      
## #   chrUn_GL456381       chrUn_GL456382       chrUn_GL456383      
## #   chrUn_GL456385       chrUn_GL456387       chrUn_GL456389      
## #   chrUn_GL456390       chrUn_GL456392       chrUn_GL456393      
## #   chrUn_GL456394       chrUn_GL456396       chrUn_JH584304      
## # (use 'seqnames()' to see all the sequence names, use the '$' or '[[' operator
## # to access a given sequence)

Creating a reference genome

We will only use the major chromosomes for our analysis so we may exclude random and unplaced contigs. Here we cycle through the major chromosomes and create a DNAStringSet object from the retrieved sequences.

## DNAStringSet object of length 22:
##          width seq                                          names               
##  [1] 195471971 NNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNN chr1
##  [2] 182113224 NNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNN chr2
##  [3] 160039680 NNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNN chr3
##  [4] 156508116 NNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNN chr4
##  [5] 151834684 NNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNN chr5
##  ...       ... ...
## [18]  90702639 NNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNN chr18
## [19]  61431566 NNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNN chr19
## [20] 171031299 NNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNN chrX
## [21]  91744698 NNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNN chrY
## [22]     16299 GTTAATGTAGCTTAATAACAA...TACGCAATAAACATTAACAA chrM

Creating an Rsubread index

We will be aligning using the subjunc algorithm from the folks behind subread. We can therefore use the Rsubread package. Before we attempt to align our fastq files, we will need to first build an index from our reference genome using the buildindex() function.

The buildindex() function simply takes the parameters of our desired index name and the FASTA file to build index from.

REMEMBER: Building an index is memeory intensive and by default is set to 8GB. This may be too large for your laptop or Desktop.

Rsubread RNA-seq alignment

We can align our raw sequence data in fastQ format to the new FASTA file of our mm10 genome sequence using the Rsubread package. Specifically we will be using the subjunc function as it is splice aware. This means it will detect reads that span introns. This is the major difference between alignment of RNA-seq and other gneomic technologies like DNA-seq and ChIP-seq where we will use the align function.

We can also provide a SAF or GTF to our Rsubread call. Although largely unnecessary for gene expression estimation, this will allow us to capture non-canonical splice sites.

We simply need to provide a SAF/gtf to annot.ext parameter and set useAnnotation and isGTF to FALSE/TRUE depending on whether we use SAF or GTF as external data.

SAF format using exons().

SAF format is used by RSubread to hold feature information. We simply need a table of exons’ chromosome locations (chromosome,start,end,strand) and a feature/metafeature ID.

We can retrieve exon locations and their gene ids using exons() function. We further select only exons which are annotated to exactly 1 gene.

## GRanges object with 376333 ranges and 2 metadata columns:
##                  seqnames          ranges strand |         tx_id
##                     <Rle>       <IRanges>  <Rle> | <IntegerList>
##        [1]           chr1 4807788-4807982      + |            14
##        [2]           chr1 4807823-4807982      + |            15
##        [3]           chr1 4807830-4807982      + |            16
##        [4]           chr1 4807892-4807982      + |            17
##        [5]           chr1 4807896-4807982      + |            18
##        ...            ...             ...    ... .           ...
##   [376329] chrUn_JH584304     55112-55701      - | 142445,142446
##   [376330] chrUn_JH584304     56986-57151      - | 142445,142446
##   [376331] chrUn_JH584304     58564-58835      - |        142445
##   [376332] chrUn_JH584304     58564-59690      - |        142446
##   [376333] chrUn_JH584304     59592-59667      - |        142445
##                    gene_id
##            <CharacterList>
##        [1]           18777
##        [2]           18777
##        [3]           18777
##        [4]           18777
##        [5]           18777
##        ...             ...
##   [376329]           66776
##   [376330]           66776
##   [376331]           66776
##   [376332]           66776
##   [376333]           66776
##   -------
##   seqinfo: 66 sequences (1 circular) from mm10 genome

RNA alignment with external annotation.

Now we have our SAF formatted exon information we can provide the SAF data.frame to the annot.ext argument, set isGTF as FALSE and set useAnnotation to TRUE.

Sort and index reads

As before, we sort and index our files using the Rsamtools packages sortBam() and indexBam() functions respectively.

The resulting sorted and indexed BAM file is now ready for use in external programs such as IGV as well as for further downstream analysis in R.

igv

Counting with aligned RNAseq data

Counting in gene models

With our newly aligned reads it is now possible to assign reads to genes in order to quantify a genes’ expression level (as reads) within a sample.

First we need to gather our gene models of exons and splice junctions which we can use in our later counting steps.

## [1] "CompressedGRangesList"
## attr(,"package")
## [1] "GenomicRanges"

Counting in gene models

Now we have our GRangesList with each entry corresponding to a gene and within each entry a GRanges object of the genes’ exons.

## GRangesList object of length 2:
## $`100009600`
## GRanges object with 9 ranges and 2 metadata columns:
##       seqnames            ranges strand |   exon_id   exon_name
##          <Rle>         <IRanges>  <Rle> | <integer> <character>
##   [1]     chr9 21062393-21062717      - |    233853        <NA>
##   [2]     chr9 21062400-21062717      - |    233855        <NA>
##   [3]     chr9 21062894-21062987      - |    233856        <NA>
##   [4]     chr9 21063314-21063396      - |    233857        <NA>
##   [5]     chr9 21066024-21066377      - |    233858        <NA>
##   [6]     chr9 21066940-21067093      - |    233859        <NA>
##   [7]     chr9 21066940-21067925      - |    233860        <NA>
##   [8]     chr9 21068030-21068117      - |    233867        <NA>
##   [9]     chr9 21073075-21073096      - |    233869        <NA>
##   -------
##   seqinfo: 66 sequences (1 circular) from mm10 genome
## 
## $`100009609`
## GRanges object with 8 ranges and 2 metadata columns:
##       seqnames            ranges strand |   exon_id   exon_name
##          <Rle>         <IRanges>  <Rle> | <integer> <character>
##   [1]     chr7 84935565-84941088      - |    190288        <NA>
##   [2]     chr7 84943141-84943264      - |    190289        <NA>
##   [3]     chr7 84943504-84943722      - |    190290        <NA>
##   [4]     chr7 84943504-84947000      - |    190291        <NA>
##   [5]     chr7 84946200-84947000      - |    190292        <NA>
##   [6]     chr7 84947372-84947651      - |    190293        <NA>
##   [7]     chr7 84948507-84949184      - |    190294        <NA>
##   [8]     chr7 84963816-84964115      - |    190295        <NA>
##   -------
##   seqinfo: 66 sequences (1 circular) from mm10 genome

Counting in gene models

We can now use the summarizeOverlaps() to count the reads in our BAM that overlap genes. We specify a BamFile object using the BamFile() function and specifying the yieldSize parameter to 10000 to control memory footprint.

The resulting RangedSummarizedExperiment object containing our counts and GRanges object.

## class: RangedSummarizedExperiment 
## dim: 24594 1 
## metadata(0):
## assays(1): counts
## rownames(24594): 100009600 100009609 ... 99929 99982
## rowData names(0):
## colnames(1): Sorted_Treg_1.bam
## colData names(0):

Counting in exons models

If want to start to evaluate differential transcript usage we will often evaluate counts over exons and assess for changes in exon usage.

igv

Counting in exons models

To count over exons however we will encounter an issue of assigning reads to overlapping exons.

igv

Counting in exons models

We can then work to differentiate exons by collapsing exons to the nonoverlapping, disjoint regions.

igv

Counting in exons models

We can use the GenomicFeatures’ package’s disjointExons() function with our TxDb object, TxDb.Mmusculus.UCSC.mm10.knownGene, to extract a GRanges of our disjoint exons with their associated gene_id,tx_name,exonic_part in the metadata columns.

We add some sensible names to our GRanges for use in later steps.

## GRanges object with 3 ranges and 3 metadata columns:
##               seqnames            ranges strand |         gene_id
##                  <Rle>         <IRanges>  <Rle> | <CharacterList>
##   100009600_1     chr9 21062393-21062399      - |       100009600
##   100009600_2     chr9 21062400-21062717      - |       100009600
##   100009600_3     chr9 21062894-21062987      - |       100009600
##                                                                      tx_name
##                                                              <CharacterList>
##   100009600_1                      ENSMUST00000115494.2,ENSMUST00000216967.1
##   100009600_2 ENSMUST00000115494.2,ENSMUST00000216967.1,ENSMUST00000213826.1
##   100009600_3                      ENSMUST00000115494.2,ENSMUST00000213826.1
##               exonic_part
##                 <integer>
##   100009600_1           1
##   100009600_2           2
##   100009600_3           3
##   -------
##   seqinfo: 66 sequences (1 circular) from mm10 genome

Counting in exons models

We can now again use the summarizeOverlaps() function with our nonOverlapping, disjoint exons to identify a BAM file of interest.

We need to set the inter.feature to FALSE to allow us to count reads overlapping multiple features (such as a read spanning between two exons).

Counting in exons models

Now we can review our counts from our gene-level and exon-level counting.

We retrieve a matrix of counts from either RangedSummarizedExperiment object using the assay() function.

##             Sorted_Treg_1.bam
## 100009600_1                 0
## 100009600_2                 1
##           Sorted_Treg_1.bam
## 100009600                 1
## 100009609                 0

Transcript quantification with pseudo-alignment

Transcript quantification

More recently methods have developed to directly quantify transcript abundance from fastq files using k-mer counting.

k-mer counting offers a super fast method of gaining transcript abundance estimates, but does not produce a genomic alignment so not so useful for visualization.

As we are most often interested in gene expression changes this offers a fast alternative to counting for alignment and counting.

The two leading softwares for k-mer counting

  • Salmon - Relationship with DESeq2.

  • Kallisto - History of splice aware aligners and transcript quantification.

Salmon

Salmon is the latest in a set of k-mer counting tools from the Kingsford lab (Jellyfish and Sailfish) and offers a workflow for transcript quantification from fastq raw seqeuncing data and a fasta file of transcript sequences.

Source code is available for all systems and binaries available for Macs and Linux at their github.. Instructions for getting salmon through conda or docker images and available here

Reference for transcript quantification.

First we need to produce a FASTA file of transcript sequences. We can use the GenomicFeatures’ packages extractTranscriptSeqs function to retrieve a DNAstringSet object of transcript sequences using our sequence data from BSgenome.Mmusculus.UCSC.mm10 and gene models from TxDb.Mmusculus.UCSC.mm10.knownGene object.

## DNAStringSet object of length 142446:
##           width seq                                         names               
##      [1]   1070 AAGGAAAGAGGATAACACTT...GGTATAAAGTTAGAGAAATG ENSMUST00000193812.1
##      [2]    110 GTGCTTGCTTCGGCAACACA...ATATTGAGTAACAAAAATAT ENSMUST00000082908.1
##      [3]    480 TTCTTCTGTGTCTGAGCTAT...GGGATAAATTTACCTTCAAT ENSMUST00000192857.1
##      [4]    250 TGAAAATGGATAGCAGTTCC...TGATCAAATGATTACTCCCA ENSMUST00000161581.1
##      [5]    926 ATGGCATCGTTGGCAGACCG...AAGACAGTAAGAGAGAGTAA ENSMUST00000192183.1
##      ...    ... ...
## [142442]     99 TGTGTTGTCACAGTGCTCCA...TCCACTGTGCTGTCACAGTG ENSMUST00000184505.1
## [142443]    101 GAAATGCCTCCATGAGATCC...TCTCTGGGCTGGTAGTCTTG ENSMUST00000178705.1
## [142444]    100 GAAATGCCTCCATGAGATCC...ATCTCTGGGCTGGTAGTCTT ENSMUST00000180206.1
## [142445]   2373 AGCTGTCCCGGGGACCACAG...AAGTACTTCGAGAGAATGGC ENSMUST00000179505.7
## [142446]   4060 CTCTCTGCTGCCGGAGCAAG...TTGCAAGACATTTTGAAATT ENSMUST00000178343.1

Reference for transcript quantification

With our new DNAstringSet object of transcript sequences we can write a FASTA file for use in Salmon with the writeXStringSet function.

Salmon index from R

As with standard aligners, the first part of our quantification requires us to make an index from our FASTA file using the Salmon index command. We specify the transcript fasta file (-t) and index name (-i).

Here we arrange our Salmon command into a system call from R to allow us to loop through files.

~/bin/salmon index -i mm10Trans -t mm10Trans.fa

Salmon quant from R

To quantify transcript abundance we use the Salmon quant command. For Mac we may need to specify a library location as seen here.

We specify the index location (-i), the reads to quantify (-r), the output directory (-o) and the library type as automatic detection of type (-l A).

~/bin/salmon quant -i mm10Trans -r ~/Downloads/ENCFF332KDA.fastq.gz -o TReg_1_Quant -l A

Salmon outputs

The output from Salmon is our quantification of transcripts in the quant.sf file in the user defined output directory.

We will use a new package tximport to handle these counts in next session but for now we can review file in standard R libraries.

##                   Name Length EffectiveLength    TPM NumReads
## 1 ENSMUST00000193812.1   1070         820.000 0.0000        0
## 2 ENSMUST00000082908.1    110           3.749 0.0000        0
## 3 ENSMUST00000192857.1    480         230.000 0.8368        7

Time for an exercise

Link_to_exercises

Link_to_answers